在 GPU 加速中,我們必須拋棄「計算優先」的思維模式。現代性能由 記憶體管理:主機(CPU)與裝置(GPU)之間資料配置、同步與最佳化的協調。
1. 記憶體與計算的差異
儘管 GPU 的算術吞吐量($TFLOPS$)已大幅躍升,但記憶體頻寬($GB/s$)卻增長緩慢得多。這造成了一個差距,使執行單元經常處於「飢餓」狀態,等待來自顯示記憶體(VRAM)的資料到達。因此, GPU 程式設計通常就是記憶體程式設計。
2. 屋頂模型
此模型可視化 運算強度 (FLOPs/Byte)與效能之間的關係。應用程式通常分為兩類:
- 記憶體受限: 受頻寬限制(陡峭的斜坡)。
- 計算受限: 受峰值 TFLOPS 限制(水平的天花板)。
3. 資料移動的成本
主要的效能瓶頸很少是數學運算本身;而是透過 PCIe 總線或從 HBM 移動一個位元組所帶來的延遲與能源成本。高性能程式會優先考慮資料的駐留位置,並最小化主機與裝置之間的資料傳輸。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
What is the primary cause of a GPU kernel being 'memory-bound'?
The clock speed of the GPU cores is too slow.
The rate of data delivery is slower than the rate of arithmetic execution.
There are too many threads running in parallel.
The CPU is faster than the GPU.
✅ Correct!
Correct! When data cannot be fed to execution units fast enough to keep them busy, the kernel is limited by memory bandwidth.❌ Incorrect
Memory-bound refers specifically to the bandwidth bottleneck, not core clock speeds.QUESTION 2
In the context of GPU programming, what does 'Memory Management' involve?
Only allocating variables on the CPU stack.
Controlling allocation, synchronization, and optimization of data transfer between host and device.
Optimizing the cache size of the L1 controller.
Manually cleaning the GPU registers after every kernel call.
✅ Correct!
Correct. It is the strategic orchestration of data across the entire hardware hierarchy.❌ Incorrect
Memory management in HIP/ROCm encompasses the movement and lifecycle of data between Host and Device.QUESTION 3
Which axis of the Roofline Model represents 'Arithmetic Intensity'?
Vertical Axis (Y)
Horizontal Axis (X)
The slope of the line.
The area under the curve.
✅ Correct!
Correct. The X-axis measures FLOPs per Byte, determining where an application sits relative to the bandwidth wall.❌ Incorrect
The Y-axis represents performance (GFLOPS); the X-axis represents intensity.QUESTION 4
Why is redundant host-device transfer considered a 'performance tax'?
It consumes GPU registers.
Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.
It increases the floating-point precision error.
It causes the GPU to overheat instantly.
✅ Correct!
Correct. Data movement is often the most expensive operation in terms of both time and power.❌ Incorrect
Data movement doesn't affect math precision; it affects performance and power efficiency.QUESTION 5
If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?
The math instructions are too complex.
Inefficient orchestration of data residence causing the GPU to wait for data.
The GPU has too much VRAM.
The kernel was written in C++ instead of Python.
✅ Correct!
Correct. Stalls usually indicate the compute units are idle while waiting for high-latency memory transactions.❌ Incorrect
Complex math would make a kernel compute-bound, not necessarily cause 95% idle stalls.Case Study: The Climate Simulation Bottleneck
Optimizing a Fluid Dynamics Kernel
A research team is running a massive climate simulation. Their HIP kernel calculates fluid dynamics at high TFLOPS theoretically, but Profiling shows the GPU spends 95% of its time stalled. The team currently transfers data from Host to Device at every time-step.
Q
Why does transferring data at every time-step likely cause the 95% stall?
Solution:
The PCIe bottleneck: The time taken to move data between Host RAM and Device VRAM via the interconnect is orders of magnitude slower than the kernel execution, forcing the GPU to wait (stall) for the next set of data.
The PCIe bottleneck: The time taken to move data between Host RAM and Device VRAM via the interconnect is orders of magnitude slower than the kernel execution, forcing the GPU to wait (stall) for the next set of data.
Q
Based on the axiom 'GPU programming is memory programming,' what should the team's first optimization step be?
Solution:
Strategic orchestration of data residence: The team should keep data on the GPU across multiple time-steps and only transfer results back to the host when necessary, minimizing 'redundant' transfers.
Strategic orchestration of data residence: The team should keep data on the GPU across multiple time-steps and only transfer results back to the host when necessary, minimizing 'redundant' transfers.